Interaction Terms (Part 2)

STA6235: Modeling in Regression

Introduction

  • Recall the general linear model,

y = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k + \varepsilon

  • Last lecture, we begain talking about interactions and focused on continuous \times continuous interactions.

    • e.g., x_1 \times x_2
  • Today, we will begin talking about interactions with categorical variables.

Interactions with Categorical Variables

  • Recall that if a categorical predictor with c classes is included in the model, we will include c-1 terms to represent it.

  • This holds true for interactions:

    • Categorical \times categorical: (c_1-1)(c_2-1)

    • Categorical \times continuous: (c-1)(1)

  • Note that a special (and easy!) case is when our categorical variable is binary: c-1 = 1.

  • Consider factor A, with 3 levels, and factor B, with 4 levels.

    • 2 \times 3 = 6 terms in the model 😬

Today’s Data

library(tidyverse)
library(fastDummies)
ratings <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-25/ratings.csv')
details <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-25/details.csv')
analytic <- full_join(ratings, details, by = "id") %>% 
  filter(playingtime <= 300 & # ≤ 300 min
         playingtime > 0 & # > 0 min
         year > 1900 & # filter out games without years
         year < 2024 & # filter out games with too large of years
         minplayers > 0) %>% # require at least 1 player for game
  mutate(play60 = playingtime/60) %>%
  select(id, name, year, average, play60, minplayers) %>%
  mutate(year2013 = if_else(year >= 2013, 1, 0),
         play_hours = case_when(play60 <= 1 ~ 1,
                                play60 > 1 & play60 <= 2 ~ 2,
                                play60 > 2 & play60 <= 3 ~ 3,
                                play60 > 3 & play60 <= 4 ~ 4,
                                play60 > 4 & play60 <= 5 ~ 5),
         play_home = if_else(minplayers <= 2, 1, 0)) %>%
  dummy_cols(select_columns = "play_hours") %>%
  na.omit()

Example 1 - Model

  • Let’s model the average rating as a function of if the game was made in the last 10 years (year2013), if I can play it at home (play_home), the length of game play (play_hours - categorical!), the interaction between if I can play it at home and if the game was made in the last 10 years, and the interaction between if I can play it at home and the length of game play.
m1 <- lm(average ~ year2013 + play_home + play_hours_2 + play_hours_3 + play_hours_4 + play_hours_5 +
           play_home:year2013 + # interaction between play_home and year2013
           play_home:play_hours_2 + play_home:play_hours_3 + play_home:play_hours_4 + play_home:play_hours_5, # interaction between play_home and play_hours
         data = analytic) 
summary(m1)

Call:
lm(formula = average ~ year2013 + play_home + play_hours_2 + 
    play_hours_3 + play_hours_4 + play_hours_5 + play_home:year2013 + 
    play_home:play_hours_2 + play_home:play_hours_3 + play_home:play_hours_4 + 
    play_home:play_hours_5, data = analytic)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.1332 -0.4586  0.0415  0.5160  2.9060 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)             5.95235    0.02100 283.487  < 2e-16 ***
year2013                0.45949    0.02929  15.685  < 2e-16 ***
play_home              -0.05833    0.02292  -2.545   0.0109 *  
play_hours_2            0.33735    0.03970   8.498  < 2e-16 ***
play_hours_3            0.72737    0.08349   8.712  < 2e-16 ***
play_hours_4            0.88137    0.13125   6.715 1.93e-11 ***
play_hours_5            1.27010    0.17131   7.414 1.28e-13 ***
year2013:play_home      0.30495    0.03168   9.626  < 2e-16 ***
play_home:play_hours_2  0.17735    0.04259   4.164 3.14e-05 ***
play_home:play_hours_3  0.06271    0.08760   0.716   0.4741    
play_home:play_hours_4  0.02057    0.13563   0.152   0.8794    
play_home:play_hours_5 -0.43971    0.18238  -2.411   0.0159 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7804 on 19897 degrees of freedom
Multiple R-squared:  0.2501,    Adjusted R-squared:  0.2497 
F-statistic: 603.3 on 11 and 19897 DF,  p-value: < 2.2e-16

Testing Categorical \times Categorical Interactions

  • As we see in the model, a categorical \times categorical interaction results in (c_1-1)(c_2-1) terms.

    • In our example, play_home \times play_hours results in 4 terms.
  • If we want to know if the interaction - overall - is significant, then we must perform the partial F test.

    • The reduced model removes only the terms related to the specific interaction we are interested in.

      • e.g., in our example, we would remove play_home:play_hours_2, play_home:play_hours_3, play_home:play_hours_4, play_home:play_hours_5 to determine if the interaction between play_home and play_hours is significant.
  • Note that in the case of binary \times binary or binary \times continuous interactions, we can use the results from summary().

Example 1 - Testing

  • Let’s determine which interactions are significant.
summary(m1)

Call:
lm(formula = average ~ year2013 + play_home + play_hours_2 + 
    play_hours_3 + play_hours_4 + play_hours_5 + play_home:year2013 + 
    play_home:play_hours_2 + play_home:play_hours_3 + play_home:play_hours_4 + 
    play_home:play_hours_5, data = analytic)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.1332 -0.4586  0.0415  0.5160  2.9060 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)             5.95235    0.02100 283.487  < 2e-16 ***
year2013                0.45949    0.02929  15.685  < 2e-16 ***
play_home              -0.05833    0.02292  -2.545   0.0109 *  
play_hours_2            0.33735    0.03970   8.498  < 2e-16 ***
play_hours_3            0.72737    0.08349   8.712  < 2e-16 ***
play_hours_4            0.88137    0.13125   6.715 1.93e-11 ***
play_hours_5            1.27010    0.17131   7.414 1.28e-13 ***
year2013:play_home      0.30495    0.03168   9.626  < 2e-16 ***
play_home:play_hours_2  0.17735    0.04259   4.164 3.14e-05 ***
play_home:play_hours_3  0.06271    0.08760   0.716   0.4741    
play_home:play_hours_4  0.02057    0.13563   0.152   0.8794    
play_home:play_hours_5 -0.43971    0.18238  -2.411   0.0159 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7804 on 19897 degrees of freedom
Multiple R-squared:  0.2501,    Adjusted R-squared:  0.2497 
F-statistic: 603.3 on 11 and 19897 DF,  p-value: < 2.2e-16
full <- lm(average ~ year2013 + play_home + play_hours_2 + play_hours_3 + play_hours_4 + play_hours_5 +
           play_home:year2013 + # interaction between play_home and year2013
           play_home:play_hours_2 + play_home:play_hours_3 + play_home:play_hours_4 + play_home:play_hours_5, # interaction between play_home and play_hours
         data = analytic) 

reduced <- lm(average ~ year2013 + play_home + play_hours_2 + play_hours_3 + play_hours_4 + play_hours_5 +
           play_home:year2013, # interaction between play_home and year2013
         data = analytic) 

anova(reduced, full, test = "F")

Example 1 - Testing

  • Hypotheses

    • H_0: \ \beta_{\text{year2013 $\times$ play\_home}} = 0
    • H_0: \ \beta_{\text{year2013 $\times$ play\_home}} \ne 0
  • Test Statistic and p-Value

    • t_0 = 9.63
    • p < 0.001
  • Rejection Region

    • Reject H_0 if p < \alpha; \alpha = 0.05.
  • Conclusion/Interpretation

    • Reject H_0.

    • There is sufficient evidence to suggest that the relationship between average game rating and a minimum player count of 1 or 2 depends on if the game was made in the last 10 years or not.

Example 1 - Testing

  • Hypotheses

    • H_0: \ \beta_{\text{play\_home $\times$ play\_hours\_2}} = \beta_{\text{play\_home $\times$ play\_hours\_3}} = \beta_{\text{play\_home $\times$ play\_hours\_4}} = \beta_{\text{play\_home $\times$ play\_hours\_5}} = 0
    • H_1: at least one \beta_i \ne 0
  • Test Statistic and p-Value

    • F_0 = 6.06
    • p < 0.001
  • Rejection Region

    • Reject H_0 if p < \alpha; \alpha = 0.05.
  • Conclusion/Interpretation

    • Reject H_0.

    • There is sufficient evidence to suggest that the relationship between average game rating and a minimum player count of 1 or 2 depends on if the game was made in the last 10 years or not.

Example 1 - Data Visualization

Example 2 - Model

  • Let’s now model the average rating as a function of if the game was made in the last 10 years (year2013), if I can play it at home (play_home), the length of game play (play60 - continuous!), the interaction between if I can play it at home and if the game was made in the last 10 years, and the interaction between if I can play it at home and the length of game play.
m2 <- lm(average ~ year2013 + play_home + play60 + # main effects
           play_home:year2013 + # interaction between play_home and year2013
           play_home:play60, # interaction between play_home and play60
         data = analytic) 
summary(m2)

Call:
lm(formula = average ~ year2013 + play_home + play60 + play_home:year2013 + 
    play_home:play60, data = analytic)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.9403 -0.4539  0.0416  0.5087  2.8929 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         5.750282   0.027097 212.212   <2e-16 ***
year2013            0.478982   0.029195  16.406   <2e-16 ***
play_home          -0.028208   0.029200  -0.966    0.334    
play60              0.313638   0.019092  16.427   <2e-16 ***
year2013:play_home  0.300712   0.031547   9.532   <2e-16 ***
play_home:play60    0.005379   0.020078   0.268    0.789    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7751 on 19903 degrees of freedom
Multiple R-squared:  0.2601,    Adjusted R-squared:  0.2599 
F-statistic:  1399 on 5 and 19903 DF,  p-value: < 2.2e-16

Example 2 - Model Visualization

Live coding!

:::

Wrap Up

  • What we have learned about interactions holds true regardless of the type of modeling we are doing.

    • We may not explicitly talk about interactions in the future, however, it is valid to be asked to include them in models.
  • Today’s activity:

    • Continue to model using the penguin data.

    • You’ll now be including catgorical interaction terms in the model.